Skip to content

Conversation

@Calvin-Xu
Copy link
Member

Description

Addresses #2109

Implements Gated Attention per https://github.com/qiuzh20/gated_attention and sweeps to find optimal LR scaling factor.

Basically ready for a while but rerunning the 1.2B track on v5p-32 to get good hardware FLOPs data points, and we are having trouble launching v5p-32 specifically on our clusters. Putting a draft PR up to let people know this is being worked on lest effort be duplicated (Will almost did).

Checklist

  • You ran uv run python infra/pre-commit.py --all-files to lint/format your code
  • You ran 'pytest' to test your code
  • Delete this checklist

@Calvin-Xu Calvin-Xu changed the title Calvin/gated attention Gated Attention & Scaling Speedruns Jan 5, 2026
@Calvin-Xu
Copy link
Member Author

For some reason, not having the separate gate projection seems to be much more expensive FLOP-wise here, even after obtaining the results on v5p32. Will revert back to that and redo 2x and 2.5x sweep.

newplot

let XLA do its magic & not mess up dimension size alignment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants